Learn how to impute missing values, convert categorical data to numeric values, scale data, evaluate multiple supervised learning models simultaneously, and build pipelines to streamline your workflow!
This is my learning experience of data science through DataCamp
Exploring categorical features
Gapminder datasets that you worked with in previous chapters also contained a categorical ‘Region’ feature, which we dropped since we didn’t have the tools. We have added it back in now that you know about it!
Code
import matplotlib.pyplot as plt# Import pandasimport pandas as pd# Read 'gapminder.csv' into a DataFrame: dfdf = pd.read_csv("gm_2008_region.csv")# Create a boxplot of life expectancy per regiondf.boxplot('life', 'Region', rot=60)# Show the plotplt.show()
Creating dummy variables
Scikit-learn does not accept non-numerical features. Earlier we have learned that the ‘Region’ feature contains useful information for predicting life expectancy. Compared to Europe and Central Asia, Sub-Saharan Africa has a lower life expectancy. Thus, retaining the ‘Region’ feature is preferable if we are trying to predict life expectancy. In this exercise, we
Code
# Create dummy variables: df_regiondf_region = pd.get_dummies(df)# Print the columns of df_regionprint(df_region.columns)print(df.sample(10))# Create dummy variables with drop_first=True: df_regiondf_region = pd.get_dummies(df,drop_first=True)# Print the new columns of df_regionprint(df_region.columns)print(df.sample(10))
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP',
'BMI_female', 'life', 'child_mortality', 'Region_America',
'Region_East Asia & Pacific', 'Region_Europe & Central Asia',
'Region_Middle East & North Africa', 'Region_South Asia',
'Region_Sub-Saharan Africa'],
dtype='object')
population fertility HIV CO2 BMI_male GDP BMI_female \
106 143123163.0 1.49 1.0 11.982718 26.01131 22506.0 128.4903
79 406392.0 1.38 0.1 6.182771 27.68361 27872.0 124.1571
121 9226333.0 1.92 0.1 5.315688 26.37629 43421.0 122.9473
109 9109535.0 1.41 0.1 5.271223 26.51495 12522.0 130.3755
37 6004199.0 2.32 0.8 1.067765 26.36751 7450.0 119.9321
89 16519862.0 1.77 0.2 10.533028 26.01541 47388.0 121.6950
30 19261647.0 4.91 3.7 0.361897 22.56469 2854.0 131.5237
83 4111168.0 1.49 0.4 1.313321 24.23690 3890.0 129.9424
80 3414552.0 4.94 0.7 0.613104 22.62295 3356.0 129.9875
76 27197419.0 2.05 0.5 7.752234 24.73069 19968.0 123.8593
life child_mortality Region
106 67.6 13.5 Europe & Central Asia
79 81.4 6.6 Europe & Central Asia
121 81.1 3.2 Europe & Central Asia
109 76.4 8.0 Europe & Central Asia
37 74.1 21.6 America
89 80.3 4.8 Europe & Central Asia
30 55.9 116.9 Sub-Saharan Africa
83 69.6 17.6 Europe & Central Asia
80 63.6 103.0 Sub-Saharan Africa
76 74.5 8.0 East Asia & Pacific
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP',
'BMI_female', 'life', 'child_mortality', 'Region_East Asia & Pacific',
'Region_Europe & Central Asia', 'Region_Middle East & North Africa',
'Region_South Asia', 'Region_Sub-Saharan Africa'],
dtype='object')
population fertility HIV CO2 BMI_male GDP BMI_female \
69 4109389.0 1.57 0.10 3.996722 27.20117 14158.0 127.5037
103 10577458.0 1.36 0.50 5.486926 26.68445 27747.0 127.2631
61 4480145.0 2.00 0.20 9.882531 27.65325 47713.0 124.7801
79 406392.0 1.38 0.10 6.182771 27.68361 27872.0 124.1571
114 9132589.0 7.06 0.60 0.068219 21.96917 615.0 131.5318
22 19570418.0 5.17 5.30 0.295542 23.68173 2571.0 127.2823
128 10408091.0 2.04 0.06 2.440669 25.15699 9938.0 128.6291
110 5521838.0 5.13 1.60 0.118256 22.53139 1289.0 134.7160
137 13114579.0 5.88 13.60 0.148982 20.68321 3039.0 132.4493
56 10050699.0 1.33 0.06 5.453232 27.11568 23334.0 128.6968
life child_mortality Region
69 77.6 11.3 Middle East & North Africa
103 79.2 4.1 Europe & Central Asia
61 79.4 4.5 Europe & Central Asia
79 81.4 6.6 Europe & Central Asia
114 56.7 168.5 Sub-Saharan Africa
22 56.6 113.8 Sub-Saharan Africa
128 76.5 19.4 Middle East & North Africa
110 55.9 179.1 Sub-Saharan Africa
137 52.0 94.9 Sub-Saharan Africa
56 73.8 7.2 Europe & Central Asia
Using categorical features in regression
We can now build regression models using the dummy variables we have created using the ‘Region’ feature. We will perform 5-fold cross-validation here using ridge regression.
Code
import pandas as pdimport numpy as npdf=pd.read_csv('gm_2008_region.csv')df_region=pd.get_dummies(df)df_region=df_region.drop('Region_America',axis=1)X=df_region.drop('life',axis=1).valuesy=df_region['life'].values
C:\Users\dghr201\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
C:\Users\dghr201\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
C:\Users\dghr201\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
C:\Users\dghr201\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
C:\Users\dghr201\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
Dropping missing data
Earlier excercise voting dataset contained a bunch of missing values that we handled behind the scenes. Now it’s your turn!
DataFrame df loaded with unprocessed dataset. Use the .head() method in the IPython Shell. Some data points are labeled with ‘?’. Missing values are here. Different datasets encode missing values differently. Data in real life can be messy - sometimes a 9999, sometimes a 0. It’s possible that the missing values are already encoded as NaN. By using NaN, we can take advantage of pandas methods like .dropna() and .fillna(), as well as scikit-learn’s Imputation transformer Imputer().
This exercise requires you to convert ’? to NaNs, and then drop the rows that contain them.
# Convert '?' to NaNdf[df =='?'] = np.nan# Print the number of NaNsprint(df.isnull().sum())# Print shape of original DataFrameprint("Shape of Original DataFrame: {}".format(df.shape))# Drop missing values and print shape of new DataFramedf = df.dropna()# Print shape of new DataFrameprint("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))
party 0
infants 12
water 48
budget 11
physician 11
salvador 15
religious 11
satellite 14
aid 15
missile 22
immigration 7
synfuels 21
education 31
superfund 25
crime 17
duty_free_exports 28
eaa_rsa 104
dtype: int64
Shape of Original DataFrame: (435, 17)
Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)
When many values in your dataset are missing, if you drop them, you may end up throwing away valuable information along with the missing data. It’s better instead to develop an imputation strategy. This is where domain knowledge is useful, but in the absence of it, you can impute missing values with the mean or the median of the row or column that the missing value is in
Imputing missing data in a ML Pipeline I
The process of building a model involves many steps, such as creating training and test sets, fitting a classifier or regressor, tuning its parameters, and evaluating its performance. In this machine learning process, imputation is the first step, which is viewed as part of a pipeline. With Scikit-learn, we can piece together these steps into one process and simplify your workflow.
We will now practice setting up a pipeline with two steps: imputation and classifier instantiation. We have so far tried three classifiers: k-NN, logistic regression, and decision trees. The Support Vector Machine, or SVM, is the fourth. Don’t worry about how it works. As with the scikit-learn estimators we have worked with before, it has the same .fit() and .predict() methods.
Code
# Import the Imputer modulefrom sklearn.impute import SimpleImputerfrom sklearn.svm import SVC# Setup the Imputation transformer: impimp = SimpleImputer(missing_values='NaN', strategy='most_frequent')# Instantiate the SVC classifier: clfclf = SVC()# Setup the pipeline with the required steps: stepssteps = [('imputation', imp), ('SVM', clf)]
Imputing missing data in a ML Pipeline II
Having setup the steps of the pipeline in the previous exercise, you will now use it on the voting dataset to classify a Congressman’s party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. You can use the .fit() and .predict() methods on pipelines just as you did with your classifiers and regressors!
Code
import pandas as pdimport numpy as np#df = pd.read_csv('votes-ch1.csv')# Create arrays for the features and the response variable. As a reminder, the response variable is 'party'y = df['party'].valuesX = df.drop('party', axis=1).valuesfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report# Import necessary modulesfrom sklearn.impute import SimpleImputerfrom sklearn.pipeline import Pipelinefrom sklearn.svm import SVC# Setup the pipeline steps: stepssteps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')), ('SVM', SVC())]# Create the pipeline: pipelinepipeline = Pipeline(steps)# Create training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Fit the pipeline to the train setpipeline.fit(X_train, y_train)# Predict the labels of the test sety_pred = pipeline.predict(X_test)# Compute metricsprint(classification_report(y_test, y_pred))
In the video, Hugo demonstrated how significantly the performance of a model can improve if the features are scaled. Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.
You will now explore scaling for yourself on a new dataset - White Wine Quality! Hugo used the Red Wine Quality dataset in the video. We have used the ‘quality’ feature of the wine to create a binary target variable: If ’quality’is less than 5, the target variable is 1, and otherwise, it is 0.
The DataFrame has been pre-loaded as df, along with the feature and target variable arrays X and y. Explore it in the IPython Shell. Notice how some features seem to have different units of measurement. ‘density’, for instance, takes values between 0.98 and 1.04, while ‘total sulfur dioxide’ ranges from 9 to 440. As a result, it may be worth scaling the features here. Your job in this exercise is to scale the features and compute the mean and standard deviation of the unscaled features compared to the scaled features.
Code
df = pd.read_csv('white-wine.csv')X = df.drop('quality' , 1).values # drop target variabley1 = df['quality'].valuesy = y1 <=5# Import scalefrom sklearn.preprocessing import scale# Scale the features: X_scaledX_scaled = scale(X)# Print the mean and standard deviation of the unscaled featuresprint("Mean of Unscaled Features: {}".format(np.mean(X)))print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))# Print the mean and standard deviation of the scaled featuresprint("Mean of Scaled Features: {}".format(np.mean(X_scaled)))print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))
Mean of Unscaled Features: 18.432687072460002
Standard Deviation of Unscaled Features: 41.54494764094571
Mean of Scaled Features: 2.7452128118308485e-15
Standard Deviation of Scaled Features: 0.9999999999999999
C:\Users\dghr201\AppData\Local\Temp\ipykernel_23428\525489148.py:2: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
X = df.drop('quality' , 1).values # drop target variable
Centering and scaling in a pipeline
With regard to whether or not scaling is effective, the proof is in the pudding! See for yourself whether or not scaling the features of the White Wine Quality dataset has any impact on its performance. You will use a k-NN classifier as part of a pipeline that includes scaling, and for the purposes of comparison, a k-NN classifier trained on the unscaled data has been provided.
Code
# modified/added by Jinnyimport pandas as pddf = pd.read_csv('white-wine.csv')X = df.drop('quality' , 1).values # drop target variabley1 = df['quality'].valuesy = y1 <=5from sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifier# Import the necessary modulesfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipeline# Setup the pipeline steps: stepssteps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]# Create the pipeline: pipelinepipeline = Pipeline(steps)# Create train and test setsX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)# Fit the pipeline to the training set: knn_scaledknn_scaled = pipeline.fit(X_train,y_train)# Instantiate and fit a k-NN classifier to the unscaled dataknn_unscaled = KNeighborsClassifier().fit(X_train, y_train)# Compute and print metricsprint('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
C:\Users\dghr201\AppData\Local\Temp\ipykernel_23428\594620727.py:4: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
X = df.drop('quality' , 1).values # drop target variable
Accuracy with Scaling: 0.7700680272108843
Accuracy without Scaling: 0.6979591836734694
Fantastic! It looks like scaling has significantly improved model performance!
Bringing it all together I: Pipeline for classification
It is time now to piece together everything you have learned so far into a pipeline for classification! Your job in this exercise is to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.
You’ll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are C and gamma. C controls the regularization strength. It is analogous to the you tuned for logistic regression in Chapter 3, while gammagamma controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.
C:\Users\dghr201\AppData\Local\Temp\ipykernel_23428\2362924875.py:3: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
X = df.drop('quality' , 1).values # drop target variable
Code
# Setup the pipelinesteps = [('scaler', StandardScaler()), ('SVM', SVC())]pipeline = Pipeline(steps)# Specify the hyperparameter spaceparameters = {'SVM__C':[1, 10, 100],'SVM__gamma':[0.1, 0.01]}# Create train and test setsX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=21)# Instantiate the GridSearchCV object: cvcv = GridSearchCV(pipeline,parameters,cv=3)# Fit to the training setcv.fit(X_train,y_train)# Predict the labels of the test set: y_predy_pred = cv.predict(X_test)# Compute and print metricsprint("Accuracy: {}".format(cv.score(X_test, y_test)))print(classification_report(y_test, y_pred))print("Tuned Model Parameters: {}".format(cv.best_params_))
Bringing it all together II: Pipeline for regression
For this final exercise, you will return to the Gapminder dataset. Guess what? Even this dataset has missing values that we dealt with for you in earlier chapters! Now, you have all the tools to take care of them yourself!
Your job is to build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio of your ElasticNet using GridSearchCV.
# Setup the pipeline steps: steps# modified by Jinny: Imputer -> SimpleImputersteps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')), ('scaler', StandardScaler()), ('elasticnet', ElasticNet())]# Create the pipeline: pipelinepipeline = Pipeline(steps)# Specify the hyperparameter spaceparameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}# Create train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)# Create the GridSearchCV object: gm_cvgm_cv = GridSearchCV(pipeline, parameters, cv=3)# Fit to the training setgm_cv.fit(X_train, y_train)# Compute and print the metricsr2 = gm_cv.score(X_test, y_test)print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))print("Tuned ElasticNet R squared: {}".format(r2))